Overview

Acute kidney injury (AKI) is one of the most common complications seen in hospitalized patients. The ability to predict which patients are at highest risk for AKI would allow clinicians to intervene and prevent poor patient outcomes. Despite many attempts, however, AKI predicting models fail to perform in way that motivates their use in clinical decision support. Understanding which risk factors contribute most to AKI prediction, and how those risk factors vary based on high-level characteristics such as ICU status, can inform future studies seeking to predict AKI. This study aims to identify how risk factors differ AKI by patient location in the hospital and how modeling approaches differ in their identification of important risk factors.

Introduction

A common condition in hospitalized patients, rates of AKI are estimated to be as high as 7% in hospitalized patients and 30% in ICU patients. (Goyal et al.). AKI is a sudden loss of renal function due to non-renal causes, such as dehydration, reduced blood flow to the kidney, nephrotoxic medications, or sepsis. AKI is usually reversible, but if not addressed can progress to permanent kidney damage and multi-organ failure. Diagnosing AKI early can prevent poor patient outcomes (Goyal et al.). Many groups have attempted to build decision making tools to assist clinicians in predicting AKIs early in the disease stage allowing rapid intervention and preventing sequelae. However, the majority of AKI prediction tools, including many using state of the art machine learning methods on large, diverse datasets, struggle to improve on a trained clinicians ability to identify AKIs (De Vlieger et al). In particular, the black box issue of machine learning algorithms limits the clinicians ability to trust the models, understand what is happening to the patient, and base decisions on model outputs. One way to enhance model transparency, while improving performance is to understand which risk factors are important to model decision making. Understanding which risk factors contribute most to AKI prediction, and how those risk factors vary based on ICU status and admission type can inform future work seeking to predict AKI.

Prior research has built a data set of patients on combination anti-hypertensive with either NSAID or oxycodone to assess AKI risk (Miano et al). This dataset can be re-analyzed to identify which of the studied risk factors contributes most to prediction of AKI. Patients on combination antihypertensives and either NAISDs or oxycodone represent a common subset of hospitalized patients, thus results on this population can be expected to generalize well to all patients. This study seeks to answer the question: in patients receiving treatment for hypertension and pain, do the most predictive risk factors for AKI differ by ICU vs non-ICU status? A secondary question explored by this study is: do predictive models perform better when continuous variables are treated as categorical variables?

AKI prediction is a prominent interdisciplinary space. Clinicians, informaticians, computer scientists, and epidemiologists frequently work together and publish in this space. The work typically combines new machine learning approaches with classic clinical definitions of kidney injury and risk factors. No discipline alone has enough expertise to address the problem, making interdisciplinary collaboration key to success. Conversations with Todd Miano (PharmD, PhD research at University of Pennsylvania’s Department of Biostatistics, Epidemiology, and Bioinformatics (DBEI) and Critical Care Pharmacist) highlighted the difficulty in defining AKI though KDIGO’s multipart definition based on the non-specific marker, serum creatinine. Informatician, John Holmes (Professor of Medical Bioinformatics in Epidemiology also at the University of Pennsylvania DBEI) emphasized the fact that the various prediction models incorporate similar, large sets of risk factors. He suggested that different risk factors my play varying roles for patients in different clinical settings, such as patients in the ICU or post-operative patients. These discussions motivated the proposed study of quantifying predictive power of various risk factors, but splitting the analysis based on admission type and patient location.

A git repo for this project is available here.

Methods

Data

Todd Miano provided the data used in this project, and was previously described in Miano et al. Briefly, data was collected on 27,741 adult patients with greater than 24-hours exposure to one both an anti-hypertensive agent and an analgesic. Ant-hypertensives of interest included either a renin-angiotensin inhibitor (RAS-I) or the calcium channel blocker amlodipine. Analgesics of interest included wither an NSAID or oxycodone. Data was collected between 2014 and 2017 from patients admitted to University of Pennsylvania Health System hospitals using the Penn Data Store warehouse for electronic health records in the health system. Patients were excluded if they had contraindications to NSAIDs, including unresolved AKI within 2 weeks before entry, baseline serum creatinine >2mg/dl (an indicator of renal injury), end stage renal disease, renal replacement therapy, platelet count \(< 100*10^{11}\) (risk of bleed), pregnancy (a contraindication to RAS-I treatment), lack of baseline or follow-up serum creatinine, and history of solid organ transplant. Information was also collected variables associated with AKI, including cardiovascular conditions, diabetes, and cancer. Laboratory values and presence of nephrotoxic medications at entry are also included. In this data set, AKI’s and AKI severity were labeled according to Kidney Disease Improving Global Outcomes (KDIGO) criteria.

In this data set, comorbid conditions and laboratory values were assessed at the point in time where the patient qualified for the study (entry). That is when the patient had greater than 24-hours exposure to the drug-combinations of interest. Thus, when using this data to predict AKI, we necessarily are predicting AKI at the same point in time. Therefore, the prediction methods evaluated in this study attempt to predict AKIs using information available to clinicians up to 24 hours after initiating combination therapy of anti-hypertensives and opioids. The goal of this research is to identify which predictors are the most helpful in predicting AKIs in this setting, and how they differ, if at all, between the medical floor and the ICU.

To do this, the dataset was imported and examined. There were no missing data. However, there were abnormally high and low BMI values. Subjects with BMI <12 and >60 were removed. Next, the data were separated it into 1) the complete dataset, 2) ICU cohort, and 3) the medical floor cohort.

A second set of data with continuous variables categorized by reference ranges was created. The continuous variables BMI, white blood cell count, hemoglobin, platelets, sodium, potassium, chloride, creatinine, and index GFR were categorized based on reference ranges (CDC, Kratz 2022;, Penn Medicine, KDIGO). The categorical variables age was categorized as 18-40, 40-65, and greater than 65. Prior length of stay was grouped as zero days, 1-2 days, 3-7 days, and greater than seven days. Age and prior length of stay ranges were chosen arbitrarily.

Logistic Regression Analysis

Generalized logistic regression models were created for each data sub-set. K-fold cross validation were used to estimate prediction error. K=10 random sub-samples randomly created and K-1=9 subsets were used for training, one subset was used for testing. Area under the ROC and area under the precision recall curves were generated to evaluate model performance.

Predictor importance for logistic regression models were identified based on test-statistic magnitude and statistical significance.

Random Forest Analysis

Next, a random forest model was trained for each data subset. K-fold cross validation were used to estimate prediction error so results could be compared to logistic regression. K=10 random sub-samples randomly created and K-1=9 subsets were used for training, one subset was used for testing. To build the forest, 100 trees were used. Area under the ROC and area under the precision recall curves were generated to evaluate model performance.

Predictor importance was identified based on magnitude of mean decrease in GINI importance scores. Importantly, mean decrease in GINI importance scores are known to be inflated when involving continuous variables and categorical variables with many categories. When calculating GINI scores, the GINI index is calculated for each cut-point in a classification tree. Variables that are continuous or have many categories will have many more potential cut-points than dichotomous variables and therefore have a higher probability of coincidentally producing a better cut-point and getting a higher importance score (Nembrini, 2018; Strobl, 2007). To evaluate the effect of continuous variables on mean decrease in GINI scores for this study, the analysis was repeated using the dataset with all continuous variables categorized according to reference ranges (see above).

Import and Load Packages

# Import packages, if needed
install.packages("dplyr")
install.packages("gtsummary")
install.packages("modelsummary")
# Load packages
library(dplyr)
library(gtsummary)
library(modelsummary)
library(ggplot2)
library(randomForest)
library(tibble)
library(tidyr)
library(stringr)
library(pROC)
library(PRROC)

Read in the Data

Data were uploaded to Zenodo and are available here.

#Load the data
data <- read.csv("/Users/haedi/Repos/BMIN503_Final_Project/BMIN503_Final_Project/miano_thelen_dataset.csv", header = TRUE)

# Drop columns not needed:
# ddiGroup was needed to classify exposure in the prior study, but not needed
# here, nsaidType is missing data for 81% and so was left out. 
# aki_stage, rrt, and mortHosp30 are outcome variables needed in the prior study
# but not here; we are only interested in binary AKI as the outcome variable. 

# Table 1, sorted by ICU stay, variables renamed for readability
data %>%
  select(!c(pid, ddiGroup, nsaidType, aki_stage, rrt, mortHosp30)) %>%
  filter(bmi>12 & bmi<60) %>%
  rename(Analgesic = pain, Anithypertensive = bp, AKI = aki, Age = age, 
         Sex = sex, Race = race, BMI = bmi,  "Admission Type" = admType, 
         "Prior Length of Stay" = priorLos, "Perioperative Day" = periOp, 
         Ventilator = vent, "Congestive Heart Failure" = chf, 
         "Chronic Pulmonary Disease" = cpd, 
         HIV = hiv, "Liver Disease" = liver, "periveral Vascular Disease" = pvd,
         "Ceribrovascular Disease" = cva, "Myocardial Infarction" = mif, 
         "Valvular Disease" = valve, Hypertension = htn, 
         "Cardiac Arrhythmias" = arry, 
         "Pulmonary Circulation Disorder" = pCirc, Obesity = obese, 
         "Weight Loss" = wtLoss,
         "Fluid and Electrolyte Disorder"= fluid, "Chronic Kidney Disease" = ckd,
         "Atrial Fibrillation" = afib, "Obstructive Sleep Apnea" = osa, 
         "Diabetes Mellitus" = dm, Cancer = cancer, "Prior AKI" = prior_aki,
         Vancomycin = vanco,
         "Bactrim" = bactrim, "Vasopressors" = pressor, 
         "Other Nephrotoxins" = ntxOther, 
         "Nephrotoxic Antibiotics" = abxNTX, "Diruetics" = diuretic, 
         "Broad Spectrum Antibiotics" = gramNegBroad, 
         "Narrow Spectrum Antibiotics" = gramNegNarrow,
         "WBC x10^9cells/L" = wbc, "Hemoglobin g/dl" = hgb, 
         "Platelets x10^11/L" = platelets, "Sodium, mEq/L" = sodium, 
         "Potassium, mEq/L" = potassium, "Chloride mEq/L" = chloride,
         "Serum Creatinine at entry" = creatinine, "Baseline GFR" = indexGFR) %>%
   mutate(Cancer = factor(Cancer, level = c(0,1,2), 
                          labels = c("None", "Non-metastatic", "Metastatic"))) %>%
  mutate(icu = factor(icu, levels = c(0,1), labels = c("non-ICU", "ICU"))) %>%
  tbl_summary(by = icu)
Characteristic non-ICU, N = 23,0311 ICU, N = 4,4851
Analgesic
    NSAID 4,667 (20%) 704 (16%)
    Oxycodone 18,364 (80%) 3,781 (84%)
Anithypertensive
    CCB 4,792 (21%) 1,045 (23%)
    RAS 18,239 (79%) 3,440 (77%)
AKI 1,867 (8.1%) 585 (13%)
Age 64 (55, 72) 63 (54, 71)
Sex
    Female 11,304 (49%) 1,694 (38%)
    Male 11,727 (51%) 2,791 (62%)
Race
    Black 8,916 (39%) 1,125 (25%)
    Other / Unk 2,145 (9.3%) 575 (13%)
    White 11,970 (52%) 2,785 (62%)
BMI 30 (26, 35) 29 (25, 34)
Admission Type
    Medicine 9,335 (41%) 1,039 (23%)
    Surgery 13,696 (59%) 3,446 (77%)
Prior Length of Stay 2 (1, 3) 2 (1, 4)
Perioperative Day
    Not post-op 17,474 (76%) 2,860 (64%)
    POD one 2,503 (11%) 978 (22%)
    POD three 736 (3.2%) 173 (3.9%)
    POD two 1,907 (8.3%) 371 (8.3%)
    POD zero 411 (1.8%) 103 (2.3%)
Ventilator 340 (1.5%) 682 (15%)
Congestive Heart Failure 5,696 (25%) 1,799 (40%)
Chronic Pulmonary Disease 6,411 (28%) 1,478 (33%)
HIV 347 (1.5%) 37 (0.8%)
Liver Disease 1,184 (5.1%) 212 (4.7%)
periveral Vascular Disease 3,429 (15%) 1,092 (24%)
Ceribrovascular Disease 2,259 (9.8%) 646 (14%)
Myocardial Infarction 3,251 (14%) 1,206 (27%)
Valvular Disease 3,476 (15%) 1,232 (27%)
Hypertension 20,664 (90%) 3,777 (84%)
Cardiac Arrhythmias 4,477 (19%) 1,359 (30%)
Pulmonary Circulation Disorder 2,078 (9.0%) 640 (14%)
Obesity 5,161 (22%) 871 (19%)
Weight Loss 1,418 (6.2%) 429 (9.6%)
Fluid and Electrolyte Disorder 5,783 (25%) 1,640 (37%)
Chronic Kidney Disease 2,909 (13%) 602 (13%)
Atrial Fibrillation 3,913 (17%) 1,181 (26%)
Obstructive Sleep Apnea 3,362 (15%) 664 (15%)
Diabetes Mellitus
    DM-Comp 2,004 (8.7%) 339 (7.6%)
    DM-Non-comp 6,454 (28%) 1,209 (27%)
    None 14,573 (63%) 2,937 (65%)
Cancer
    None 18,747 (81%) 3,851 (86%)
    Non-metastatic 2,912 (13%) 441 (9.8%)
    Metastatic 1,372 (6.0%) 193 (4.3%)
Prior AKI
    None 20,906 (91%) 3,748 (84%)
    Yes-resolved 2,125 (9.2%) 737 (16%)
Vancomycin 4,702 (20%) 1,555 (35%)
Bactrim 713 (3.1%) 66 (1.5%)
Vasopressors 541 (2.3%) 643 (14%)
Other Nephrotoxins 526 (2.3%) 64 (1.4%)
Nephrotoxic Antibiotics 555 (2.4%) 363 (8.1%)
Diruetics 9,193 (40%) 2,298 (51%)
Broad Spectrum Antibiotics 2,717 (12%) 641 (14%)
Narrow Spectrum Antibiotics 9,072 (39%) 2,047 (46%)
WBC x10^9cells/L 8.9 (6.9, 11.4) 10.8 (8.3, 14.0)
Hemoglobin g/dl 11.00 (9.60, 12.40) 11.00 (9.60, 12.40)
Platelets x10^11/L 221 (173, 282) 194 (149, 251)
Sodium, mEq/L 138.0 (135.0, 140.0) 138.0 (136.0, 140.0)
Potassium, mEq/L 4.10 (3.80, 4.40) 4.10 (3.80, 4.40)
Chloride mEq/L 103.0 (101.0, 106.0) 105.0 (102.0, 109.0)
Serum Creatinine at entry 0.90 (0.72, 1.10) 0.90 (0.71, 1.08)
Baseline GFR 72 (57, 88) 75 (59, 89)
1 n (%); Median (IQR)
# Mutate binary variables to factors, keep only variables we will use
data.c.f <- data %>%
  select(!c(pid, ddiGroup, nsaidType, aki_stage, rrt, mortHosp30,)) %>%
  filter(bmi>12 & bmi<60) %>%
  mutate(across('race', str_replace, 'Other / Unk', 'Other')) %>%
  mutate(aki = factor(aki, level = c(0,1), labels = c("No AKI", "AKI"))) %>%
  mutate(cancer = factor(cancer, level = c(0,1,2), 
                         labels = c("no_cancer", "non-metastatic", "metastatic"))) %>%
   mutate(icu = factor(icu, levels = c(0,1), labels = c("non-ICU", "ICU"))) %>%
  mutate_at(c("pain", "bp", "sex", "race", "admType", "periOp", "vent", "chf", 
              "cpd", "hiv","liver", "pvd", "cva", "mif", "valve", "htn", "arry",
              "pCirc", "obese", "wtLoss", "fluid", "ckd", "afib", "osa", "dm", 
              "prior_aki", "vanco", "bactrim", "pressor","ntxOther", "abxNTX",
              "diuretic","gramNegBroad", "gramNegNarrow"), factor) %>%
  relocate(where(is.factor))%>%
  relocate(aki)

# Break up the datasets to ICU and Non-ICU cohorts 
icu <- data.c.f %>%
  filter(icu=="ICU") %>%
  select(!c(icu))

nonicu <- data.c.f%>%
  filter(icu=="non-ICU")%>%
  select(!c(icu))

Table 1 shows patient characteristics in the ICU vs non-ICU.

In Table 1 we see first that AKIs are a rare event and that they occurred at a greater proportion in the ICU than in the non-ICU group (13% vs 8%, respectively). It is important to note that the ICU group is much smaller than the non-ICU group. Certain risk factors were more common in the ICU than in the non-ICU group, such as nephrotoxic antibiotics, prior AKI, congestive heart failure, and post-op day 1.

Next, we build the categorical dataset.

# Categorizing the continuous variables using reference ranges
data.c.f.cat = data.c.f %>%
  mutate(age = cut(age, breaks = c(18, 40, 65, Inf), 
                        labels = c("18-40", "41-65", ">65"),
                        include.lowest = T)) %>%
  mutate(bmi = cut(bmi, breaks = c(-Inf, 18.5, 25, 30, Inf), # CDC
                        labels = c("Underweight", "Healthy-weight", 
                                   "Overweight","Obesity"))) %>%
  mutate(priorLos = cut(priorLos, breaks = c(-Inf, 0, 2, 7, Inf),
                            labels = c("0", "1-2", "3-7", ">7"),
                            inclue.lowest = F)) %>%
  mutate(wbc = cut(wbc, breaks = c(-Inf, 3.5, 9.1, Inf), # Harrison's
                            labels = c("low", "ref", "high"))) %>%
  mutate(hgb = cut(hgb, breaks = c(-Inf,12, 16, Inf), # Harrison's, combo M&F
                        labels = c("low", "ref", "high"))) %>%
  mutate(platelets = cut(platelets, breaks = c(-Inf, 149, 400, Inf),# PennMed
                                    labels = c("low", "ref", "high"))) %>%
  mutate(sodium = cut(sodium, breaks = c(-Inf, 135, 144, Inf),# PennMed
                              labels = c("low", "ref", "high"))) %>%
  mutate(potassium = cut(potassium, breaks = c(-Inf, 3.5, 5.1, Inf),# PennMed
                                    labels = c("low", "ref", "high"))) %>%
  mutate(chloride = cut(chloride, breaks = c(-Inf, 100, 111, Inf),# PennMed
                                  labels = c("low", "ref", "high"))) %>%
  mutate(creatinine = cut(creatinine, breaks = c(-Inf, 0.49, 1.2, Inf), 
                          # Harrison's, combo M&F
                                  labels = c("low", "ref", "high"))) %>%
  mutate(indexGFR = cut(indexGFR, breaks = c(-Inf, 15, 30, 45, 60, 90, Inf),
                            labels = c("G5", "G4", "G3b", "G3a", "G2", "G1")))

# create ICU and non-ICU subsets
icu.cat <- data.c.f.cat %>%
  filter(icu=="ICU") %>%
  select(!c(icu))

nonicu.cat <- data.c.f.cat%>%
  filter(icu=="non-ICU")%>%
  select(!c(icu))

# Table 2, continuous variables categorized
data.c.f.cat %>%
  select(c(age, bmi, priorLos, wbc, hgb, platelets, sodium, potassium, 
           chloride, creatinine, indexGFR, icu)) %>%
  rename( Age = age, BMI = bmi, "Prior Length of Stay" = priorLos, 
          "WBC x10^9cells/L" = wbc, "Hemoglobin g/dl" = hgb, 
          "Platelets x10^11/L" = platelets, "Sodium, mEq/L" = sodium, 
          "Potassium, mEq/L" = potassium, "Chloride mEq/L" = chloride, 
          "Serum Creatinine at entry" = creatinine, "Baseline eGFR" = indexGFR) %>%
  tbl_summary(by = icu)
Characteristic non-ICU, N = 23,0311 ICU, N = 4,4851
Age
    18-40 1,029 (4.5%) 256 (5.7%)
    41-65 11,749 (51%) 2,391 (53%)
    >65 10,253 (45%) 1,838 (41%)
BMI
    Underweight 545 (2.4%) 115 (2.6%)
    Healthy-weight 4,544 (20%) 1,017 (23%)
    Overweight 6,640 (29%) 1,404 (31%)
    Obesity 11,302 (49%) 1,949 (43%)
Prior Length of Stay
    0 1,580 (6.9%) 273 (6.1%)
    1-2 14,317 (62%) 2,494 (56%)
    3-7 5,390 (23%) 1,253 (28%)
    >7 1,744 (7.6%) 465 (10%)
WBC x10^9cells/L
    low 364 (1.6%) 13 (0.3%)
    ref 11,437 (50%) 1,454 (32%)
    high 11,230 (49%) 3,018 (67%)
Hemoglobin g/dl
    low 16,087 (70%) 3,067 (68%)
    ref 6,768 (29%) 1,384 (31%)
    high 176 (0.8%) 34 (0.8%)
Platelets x10^11/L
    low 3,153 (14%) 1,127 (25%)
    ref 18,312 (80%) 3,176 (71%)
    high 1,566 (6.8%) 182 (4.1%)
Sodium, mEq/L
    low 5,913 (26%) 1,026 (23%)
    ref 16,814 (73%) 3,247 (72%)
    high 304 (1.3%) 212 (4.7%)
Potassium, mEq/L
    low 2,643 (11%) 485 (11%)
    ref 20,010 (87%) 3,930 (88%)
    high 378 (1.6%) 70 (1.6%)
Chloride mEq/L
    low 5,387 (23%) 761 (17%)
    ref 16,986 (74%) 3,188 (71%)
    high 658 (2.9%) 536 (12%)
Serum Creatinine at entry
    low 446 (1.9%) 148 (3.3%)
    ref 19,138 (83%) 3,736 (83%)
    high 3,447 (15%) 601 (13%)
Baseline eGFR
    G5 0 (0%) 0 (0%)
    G4 308 (1.3%) 35 (0.8%)
    G3b 2,174 (9.4%) 384 (8.6%)
    G3a 4,328 (19%) 766 (17%)
    G2 11,113 (48%) 2,216 (49%)
    G1 5,108 (22%) 1,084 (24%)
1 n (%)

Table 2 shows categorized continuous variables for patient characteristics in the ICU vs non-ICU.

In the categorical data, we see that decreased platelets and elevated white blood cells, a sign of infection, were more common in the ICU group.

Now that the datasets are created, we can build the models and identify the variables that have the best predictive value. We first evaluate the complete, ICU, and non-ICU datasets for both logistic regression and random forest with continuous variables. We then repeat the analysis using categorical variables only.

Results

Visualizing Data

From Table 1, we can see that AKI rates are higher in the ICU. Next, we visualize the relationship between ICU stay and AKI.

ggplot(data = data.c.f, aes(x=icu, fill = aki)) +
  geom_bar(position = "fill") +
  labs(title = "Proportion of study patients with AKI by location", 
       x = "ICU Status", y ="Proporiton" )

Figure 1 Proportion of study patients with AKI in the ICU vs non-ICU.

Using a Chi-square test, we can check if there is a statistically significant difference in AKI proportion among the ICU and non-ICU groups.

# Chi-Square test
chisq.test(table(data.c.f$aki, data.c.f$icu))
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  table(data.c.f$aki, data.c.f$icu)
## X-squared = 112.12, df = 1, p-value < 2.2e-16

At the \(\alpha = 0.05\) level, we can conclude that in our study population, the AKI proportion is different between non-ICU and ICU groups. This is helpful contextual information for our data, though, it is not part of the main analysis.

Part 1: Analysis with Continous Variables included as continuous

Logistic Regression and Random Forest Models

Next we create a logistic regression and random forest models for all three data subsets (complete, ICU, and non-ICU), then compare the predictors.

Model Building for Complete Dataset

N = nrow(data.c.f)
K = 10
set.seed(1234)
s = sample(1:K, size = N, replace = T)
pred.outputs.glm <- vector(mode = "numeric", length = N)
pred.outputs.rf <- vector(mode = "numeric", length = N)
obs.outputs <- vector(mode = "numeric", length = N)
offset <- 0
for(i in 1:K){
    train <-filter(data.c.f, s != i)
    test <- filter(data.c.f, s == i)
    obs.outputs[1:length(s[s == i]) + offset] <- test$aki
    
    #GLM train/test
    glm.aki <- glm(aki ~ ., data = train, family = binomial(logit))
  glm.predictions <- predict(glm.aki, test, type = "response")
  pred.outputs.glm[1:length(s[s == i]) + offset] <- glm.predictions
  
    #RF train/test
    rf <- randomForest(aki ~ ., data = train, ntree = 100, importance = TRUE)
    rf.pred.curr <- predict(rf, newdata = test, type = "prob") 
    pred.outputs.rf[1:length(s[s == i]) + offset] <- rf.pred.curr[ , 2]
    
    offset <- offset + length(s[s == i])
}
# AUC
roc(obs.outputs, pred.outputs.glm) # glm with cross validation
## Setting levels: control = 1, case = 2
## Setting direction: controls < cases
## 
## Call:
## roc.default(response = obs.outputs, predictor = pred.outputs.glm)
## 
## Data: pred.outputs.glm in 25064 controls (obs.outputs 1) < 2452 cases (obs.outputs 2).
## Area under the curve: 0.6924
roc(obs.outputs, pred.outputs.rf, ci = TRUE) # rf with cross validation
## Setting levels: control = 1, case = 2
## Setting direction: controls < cases
## 
## Call:
## roc.default(response = obs.outputs, predictor = pred.outputs.rf,     ci = TRUE)
## 
## Data: pred.outputs.rf in 25064 controls (obs.outputs 1) < 2452 cases (obs.outputs 2).
## Area under the curve: 0.6853
## 95% CI: 0.6743-0.6963 (DeLong)
# ROC Curves
plot.roc(obs.outputs, pred.outputs.glm, col = "darkred") # Glm, k fold cross validated
## Setting levels: control = 1, case = 2
## Setting direction: controls < cases
plot.roc(obs.outputs, pred.outputs.rf, ci = TRUE, col = "navy",  add = TRUE) # RF, K-fold cross validated
## Setting levels: control = 1, case = 2
## Setting direction: controls < cases
legend("bottomright", legend = c("glm cross-validation", "rf cross-validation"), col = c("darkred", "navy"), lwd = 1)

Figure 2 ROC Curve for the Complete Dataset with Continuous Data

# PR Curves
pr <- pr.curve(pred.outputs.glm[obs.outputs == "2"], 
               pred.outputs.glm[obs.outputs == "1"], curve=TRUE)
plot(pr)

Figure 3 PR Curve for the Logistic Regression Model for the Complete Dataset with Continuous Data

prrf <- pr.curve(pred.outputs.rf[obs.outputs == "2"], 
               pred.outputs.rf[obs.outputs == "1"], curve=TRUE)
plot(prrf)

Figure 4 PR Curve for the Random Forest Model for the Complete Dataset with Continuous Data

Figures 2-4 show that the logistic regression and random forest models perform similarly when compared on the area under ROC curve and area under precision recall (PR) curve for the complete dataset with continuous data included. The performance on the ROC curve is more optimistic than the PR curve because our the data are highly skewed (AKI is a rare event). Thus model performance is better assessed on the PR curves.

Mobel Building for ICU Dataset

N = nrow(icu)
K = 10
set.seed(1234)
s = sample(1:K, size = N, replace = T)
pred.outputs.glm.icu <- vector(mode = "numeric", length = N)
pred.outputs.rf.icu <- vector(mode = "numeric", length = N)
obs.outputs.icu <- vector(mode = "numeric", length = N)
offset <- 0
for(i in 1:K){
    train <-filter(icu, s != i)
    test <- filter(icu, s == i)
    obs.outputs.icu[1:length(s[s == i]) + offset] <- test$aki
    
    #GLM train/test
    glm.aki.icu <- glm(aki ~ ., data = train, family = binomial(logit))
  glm.predictions.icu <- predict(glm.aki.icu, test, type = "response")
  pred.outputs.glm.icu[1:length(s[s == i]) + offset] <- glm.predictions.icu
  
    #RF train/test
    rf.icu <- randomForest(aki ~ ., data = train, ntree = 100, importance = TRUE)
    rf.pred.curr.icu <- predict(rf.icu, newdata = test, type = "prob") 
    pred.outputs.rf.icu[1:length(s[s == i]) + offset] <- rf.pred.curr.icu[ , 2]
    
    offset <- offset + length(s[s == i])
}
#AUC
roc(obs.outputs.icu, pred.outputs.glm.icu) # glm with cross validation
## Setting levels: control = 1, case = 2
## Setting direction: controls < cases
## 
## Call:
## roc.default(response = obs.outputs.icu, predictor = pred.outputs.glm.icu)
## 
## Data: pred.outputs.glm.icu in 3900 controls (obs.outputs.icu 1) < 585 cases (obs.outputs.icu 2).
## Area under the curve: 0.6651
roc(obs.outputs.icu, pred.outputs.rf.icu, ci = TRUE) # rf with cross validation
## Setting levels: control = 1, case = 2
## Setting direction: controls < cases
## 
## Call:
## roc.default(response = obs.outputs.icu, predictor = pred.outputs.rf.icu,     ci = TRUE)
## 
## Data: pred.outputs.rf.icu in 3900 controls (obs.outputs.icu 1) < 585 cases (obs.outputs.icu 2).
## Area under the curve: 0.6481
## 95% CI: 0.6244-0.6718 (DeLong)
#ROC Curves
plot.roc(obs.outputs.icu, pred.outputs.glm.icu, col = "darkred") # Glm, k fold cross validated
## Setting levels: control = 1, case = 2
## Setting direction: controls < cases
plot.roc(obs.outputs.icu, pred.outputs.rf.icu, ci = TRUE, col = "navy", add = TRUE) # RF, K-fold cross validated
## Setting levels: control = 1, case = 2
## Setting direction: controls < cases
legend("bottomright", legend = c("glm cross-validation", "rf cross-validation"), col = c("darkred", "navy"), lwd = 1)

Figure 5 ROC Curves for the ICU Dataset with Continuous Data

# PR Curves
pr.icu <- pr.curve(pred.outputs.glm.icu[obs.outputs.icu == "2"], 
               pred.outputs.glm.icu[obs.outputs.icu == "1"], curve=TRUE)
plot(pr.icu)

Figure 6 PR Curve for the Logistic Regression Model for the ICU Dataset with Continuous Data

prrf.icu <- pr.curve(pred.outputs.rf.icu[obs.outputs.icu == "2"], 
               pred.outputs.rf.icu[obs.outputs.icu == "1"], curve=TRUE)
plot(prrf.icu)

Figure 7 PR Curve for the Random Forest Model for the ICU Dataset with Continuous Data

Figures 5-7 show model performance for the logistic and random forest models trained on the ICU data subset with continuous data. Performance was increased slightly over the complete dataset on both the ROC and PR curves.

Mobel Building for non-ICU Dataset

N = nrow(nonicu)
K = 10
set.seed(1234)
s = sample(1:K, size = N, replace = T)
pred.outputs.glm.nonicu <- vector(mode = "numeric", length = N)
pred.outputs.rf.nonicu <- vector(mode = "numeric", length = N)
obs.outputs.nonicu <- vector(mode = "numeric", length = N)
offset <- 0
for(i in 1:K){
    train <-filter(nonicu, s != i)
    test <- filter(nonicu, s == i)
    obs.outputs.nonicu[1:length(s[s == i]) + offset] <- test$aki
    
    #GLM train/test
    glm.aki.nonicu <- glm(aki ~ ., data = train, family = binomial(logit))
  glm.predictions.nonicu <- predict(glm.aki.nonicu, test, type = "response")
  pred.outputs.glm.nonicu[1:length(s[s == i]) + offset] <- glm.predictions.nonicu
  
    #RF train/test
    rf.nonicu <- randomForest(aki ~ ., data = train, ntree = 100, importance = TRUE)
    rf.pred.curr.nonicu <- predict(rf.nonicu, newdata = test, type = "prob") 
    pred.outputs.rf.nonicu[1:length(s[s == i]) + offset] <- rf.pred.curr.nonicu[ , 2]
    
    offset <- offset + length(s[s == i])
}
#AUC
roc(obs.outputs.nonicu, pred.outputs.glm.nonicu) # glm with cross validation
## Setting levels: control = 1, case = 2
## Setting direction: controls < cases
## 
## Call:
## roc.default(response = obs.outputs.nonicu, predictor = pred.outputs.glm.nonicu)
## 
## Data: pred.outputs.glm.nonicu in 21164 controls (obs.outputs.nonicu 1) < 1867 cases (obs.outputs.nonicu 2).
## Area under the curve: 0.687
roc(obs.outputs.nonicu, pred.outputs.rf.nonicu, ci = TRUE) # rf with cross validation
## Setting levels: control = 1, case = 2
## Setting direction: controls < cases
## 
## Call:
## roc.default(response = obs.outputs.nonicu, predictor = pred.outputs.rf.nonicu,     ci = TRUE)
## 
## Data: pred.outputs.rf.nonicu in 21164 controls (obs.outputs.nonicu 1) < 1867 cases (obs.outputs.nonicu 2).
## Area under the curve: 0.6765
## 95% CI: 0.6638-0.6893 (DeLong)
#ROC Curves
plot.roc(obs.outputs.nonicu, pred.outputs.glm.nonicu, col = "darkred") # Glm, k fold cross validated
## Setting levels: control = 1, case = 2
## Setting direction: controls < cases
plot.roc(obs.outputs.nonicu, pred.outputs.rf.nonicu, ci = TRUE, col = "navy", add = TRUE) # RF, K-fold cross validated
## Setting levels: control = 1, case = 2
## Setting direction: controls < cases
legend("bottomright", legend = c("glm cross-validation", "rf cross-validation"), col = c("darkred", "navy"), lwd = 1)

Figure 8 ROC Curves for the non-ICU Dataset with Continuous Data

# PR Curves
pr.nonicu <- pr.curve(pred.outputs.glm.nonicu[obs.outputs.nonicu == "2"], 
               pred.outputs.glm.nonicu[obs.outputs.nonicu == "1"], curve=TRUE)
plot(pr.nonicu)

Figure 9 PR Curve for the Logistic Regression Model for the non-ICU Dataset with Continuous Data

prrf.nonicu <- pr.curve(pred.outputs.rf.nonicu[obs.outputs.nonicu == "2"], 
               pred.outputs.rf.nonicu[obs.outputs.nonicu == "1"], curve=TRUE)
plot(prrf.nonicu)

Figure 10 PR Curve for the Random Forest Model for the non-ICU Dataset with Continuous Data

Figures 8-10 show model performance for the logistic and random forest models trained on the non-ICU data subset with continuous data. Performance decreased slightly over the complete dataset on both the ROC and PR curves.

Identifying Top Predictors

Next, we identify the top predictors for each of the above models. First, identify top predictors for the complete dataset.

# Identify top predictors for the combined data set
# Logistic Regression Importance
glm.aki.sum <- summary(glm.aki)
glm.aki.z <- data.frame(glm.aki.sum$coefficients[,3]) # make df of z-scores
glm.aki.z <- rownames_to_column(glm.aki.z, "feature") # convert rownames to a column
glm.aki.z <- glm.aki.z[!(glm.aki.z$feature=="(Intercept)"),] # drop intercept
colnames(glm.aki.z)[2] <- "z" # rename column to z 
glm.aki.z$col.flag <- glm.aki.z$z>0 # Create a color flag
ggplot(glm.aki.z, aes(x = reorder(feature, z), y = z, fill = col.flag)) +
  geom_bar(stat = "identity") +
  coord_flip() +
  scale_fill_manual(values = c("navy", "darkred"), name = NULL, labels = c("Positively Associated", "Negatively Associated"))+
  labs(y="Relative Importance (z-score)", x = "Feature") +
  theme(axis.text.y = element_text(size = 6))

Figure 11 Relative Importance in the Logistic Regression Model for the Complete Dataset with Continuous Data

#Get 10 Top Predictors by GLM
glm.aki.pvals <- data.frame(glm.aki.sum$coefficients[,4])
top.glm.aki.pvals <- glm.aki.pvals %>%
  filter(rownames(glm.aki.pvals) != "(Intercept)") %>%
  arrange(glm.aki.sum.coefficients...4.) %>%
  slice_head(n=10)
# Random Forest Importance
gini <- as.data.frame(rf$importance)
gini <- rownames_to_column(gini, "feature") # convert rownames to a column
ggplot(gini, aes(x = reorder(feature, MeanDecreaseGini) , y = MeanDecreaseGini))+
  geom_bar(stat = "identity", fill = "darkred") +
  coord_flip()+
  labs(y="Mean Decrease in GINI Score", x = "Feature")+
  theme(axis.text.y = element_text(size = 6))

Figure 12 Relative Importance in the Random Forest Model for the Complete Dataset with Continuous Data

Figures 11 and 12 show the ranking of importance for each variable according to the logistic regression and random forest models for the complete dataset. The most important predictors are different between the two model types. However, for the mean decrease in GINI scores, the top predictors happen to be the continuous variables and they appear to be much more important than all of the categorical variables. This is an indicator that they mean decease in GINI might be artificially inflating their importance.

# Get 10 top Predictors by RF (highest decrease in GINI)
top.rf.gini = gini %>%
  arrange(desc(MeanDecreaseGini)) %>%
  slice_head(n=10)

Repeat the process for ICU subset.

# Identify top predictors for the ICU subset
# Logistic Regression Importance
glm.aki.sum.icu <- summary(glm.aki.icu)
glm.aki.z.icu <- data.frame(glm.aki.sum.icu$coefficients[,3]) # make df of zscores
glm.aki.z.icu <- rownames_to_column(glm.aki.z.icu, "feature") # convert rownames to a column
glm.aki.z.icu <- glm.aki.z.icu[!(glm.aki.z.icu$feature=="(Intercept)"),] # drop intercept
colnames(glm.aki.z.icu)[2] <- "z" # rename column to z 
glm.aki.z.icu$col.flag <- glm.aki.z.icu$z>0 # Create a color flag
ggplot(glm.aki.z.icu, aes(x = reorder(feature, z), y = z, fill = col.flag)) +
  geom_bar(stat = "identity") +
  coord_flip() +
  scale_fill_manual(values = c("navy", "darkred"), name = NULL, labels = c("Positively Associated", "Negatively Associated"))+
  labs(y="Relative Importance (z-score)", x = "Feature") +
  theme(axis.text.y = element_text(size = 6))

Figure 13 Relative Importance in the Logistic Regression Model for the ICU-Dataset with Continuous Data

#Get 10 Top Predictors by GLM
glm.aki.pvals.icu <- data.frame(glm.aki.sum.icu$coefficients[,4])
top.glm.aki.pvals.icu <- glm.aki.pvals.icu %>%
  filter(rownames(glm.aki.pvals.icu) != "(Intercept)") %>%
  arrange(glm.aki.sum.icu.coefficients...4.) %>%
  slice_head(n=10)
# Random Forest Importance
gini.icu <- as.data.frame(rf.icu$importance)
gini.icu <- rownames_to_column(gini.icu, "feature") # convert rownames to a column
ggplot(gini.icu, aes(x = reorder(feature, MeanDecreaseGini) , y = MeanDecreaseGini))+
  geom_bar(stat = "identity", fill = "darkred") +
  coord_flip()+
  labs(y="Mean Decrease in GINI Score", x = "Feature") +
  theme(axis.text.y = element_text(size = 6))

Figure 14 Relative Importance in the Random Forest Model for the ICU-Dataset with Continuous Data

Figures 13 and 14 show that the random forest and logistic regression models found different variables to be the most important on the ICU data subset. When compared to the complete dataset, the logistic regression model found nephrotoxic antibiotics to be the most important in the ICU subset, where in the complete dataset, diuretic use was the most important. The mean decrease in GINI scores still appear to be artificially inflated for the continuous variables.

# Get 10 top Predictors by RF (highest decrease in GINI)
top.rf.gini.icu = gini.icu %>%
  arrange(desc(MeanDecreaseGini)) %>%
  slice_head(n=10)

Repeat the process for non-ICU subset.

# Identify top predictors for the icu sub set
# Logistic Regression Importance
glm.aki.sum.nonicu <- summary(glm.aki.nonicu)
glm.aki.z.nonicu <- data.frame(glm.aki.sum.nonicu$coefficients[,3]) # make df of zscores
glm.aki.z.nonicu <- rownames_to_column(glm.aki.z.nonicu, "feature") # convert rownames to a column
glm.aki.z.nonicu <- glm.aki.z.nonicu[!(glm.aki.z.nonicu$feature=="(Intercept)"),] # drop intercept
colnames(glm.aki.z.nonicu)[2] <- "z" # rename column to z 
glm.aki.z.nonicu$col.flag <- glm.aki.z.nonicu$z>0 # Create a color flag
ggplot(glm.aki.z.nonicu, aes(x = reorder(feature, z), y = z, fill = col.flag)) +
  geom_bar(stat = "identity") +
  coord_flip() +
  scale_fill_manual(values = c("navy", "darkred"), name = NULL, labels = c("Positively Associated", "Negatively Associated"))+
  labs(y="Relative Importance (z-score)", x = "Feature")+
  theme(axis.text.y = element_text(size = 6))

Figure 15 Relative Importance in the Logistic Regression Model for the non-ICU-Dataset with Continuous Data

#Get 10 Top Predictors by GLM
glm.aki.pvals.nonicu <- data.frame(glm.aki.sum.nonicu$coefficients[,4])
top.glm.aki.pvals.nonicu <- glm.aki.pvals.nonicu %>%
  filter(rownames(glm.aki.pvals.nonicu) != "(Intercept)") %>%
  arrange(glm.aki.sum.nonicu.coefficients...4.) %>%
  slice_head(n=10)
# Random Forest Importance
gini.nonicu <- as.data.frame(rf.nonicu$importance)
gini.nonicu <- rownames_to_column(gini.nonicu, "feature") # convert rownames to a column
ggplot(gini.nonicu, aes(x = reorder(feature, MeanDecreaseGini) , y = MeanDecreaseGini))+
  geom_bar(stat = "identity", fill = "darkred") +
  coord_flip()+
  labs(y="Mean Decrease in GINI Score", x = "Feature")+
  theme(axis.text.y = element_text(size = 6))

Figure 16 Relative Importance in the Random Forest Model for the non-ICU-Dataset with Continuous Data

Figures 15 and 16 show that the random forest and logistic regression models found different variables to be the most important on the non-ICU data subset as well. When compared to the complete dataset, the logistic regression model found fluid status to be the most important in the ICU subset, where in the complete dataset, diuretic use was the most important. Again, the mean decrease in GINI scores appear to be artificially inflated for the continuous variables.

# Get 10 top Predictors by RF (highest decrease in GINI)
top.rf.gini.nonicu = gini.nonicu %>%
  arrange(desc(MeanDecreaseGini)) %>%
  slice_head(n=10)

Make a table that compare the top predictors for the different methods and the different datasets.

# Make a dataframe with all the results

top.compare.pval <-data.frame(rownames(top.glm.aki.pvals), rownames(top.glm.aki.pvals.nonicu),  rownames(top.glm.aki.pvals.icu)) %>%
  rename("combined" = 1, "non-ICU" =2, "ICU" = 3)
top.compare.pval
##     combined       non-ICU             ICU
## 1  diuretic1        fluid1         abxNTX1
## 2     fluid1     diuretic1            ckd1
## 3       ckd1          ckd1             bmi
## 4   bactrim1      bactrim1       ntxOther1
## 5        age           age             age
## 6     icuICU          chf1       diuretic1
## 7    abxNTX1       wtLoss1            chf1
## 8       chf1    creatinine        priorLos
## 9        bmi periOpPOD two           afib1
## 10      pvd1        liver1 periOpPOD three

Table 3 Top 10 Predictors for the Logistic Regression Models on the Three Data Sub-sets

top.compare.gini <-data.frame(top.rf.gini$feature,  top.rf.gini.nonicu$feature, top.rf.gini.icu$feature)
top.compare.gini
##    top.rf.gini.feature top.rf.gini.nonicu.feature top.rf.gini.icu.feature
## 1             indexGFR                   indexGFR                indexGFR
## 2                  bmi                        bmi              creatinine
## 3           creatinine                 creatinine                     bmi
## 4            platelets                  platelets               platelets
## 5                  wbc                        wbc               potassium
## 6                  hgb                        hgb                     age
## 7                  age                        age                     wbc
## 8             chloride                   chloride                     hgb
## 9            potassium                  potassium                chloride
## 10              sodium                     sodium                  sodium

Table 4 Top 10 Predictors for the Random Forest Models on the Three Data Sub-sets

The logistic regression models found different variables to be the most important by ICU-status when continuous data was allowed to remain in the model. For the random forest, however, the top predictors and their ranks were almost the same in the different data subset. The most important features in the random forest models all happen to be exclusively the continuous variables. Thus we turn to the next phase in the analysis, where we train random forest models on categorical data. To enable fair comparison to regression models, the regression models were also re-trained on the categorized variables.

Part 2: Analysis with Continous Variables Included as Categorized

Logistic Regression and Random Forest Models

Model Building for Complete Dataset

Analysis of the continuous variables as categorized data.

N = nrow(data.c.f)
K = 10
set.seed(1234)
s = sample(1:K, size = N, replace = T)
pred.outputs.glm <- vector(mode = "numeric", length = N)
pred.outputs.rf <- vector(mode = "numeric", length = N)
obs.outputs <- vector(mode = "numeric", length = N)
offset <- 0
for(i in 1:K){
    train <-filter(data.c.f.cat, s != i)
    test <- filter(data.c.f.cat, s == i)
    obs.outputs[1:length(s[s == i]) + offset] <- test$aki
    
    #GLM train/test
    glm.aki <- glm(aki ~ ., data = train, family = binomial(logit))
  glm.predictions <- predict(glm.aki, test, type = "response")
  pred.outputs.glm[1:length(s[s == i]) + offset] <- glm.predictions
  
    #RF train/test
    rf <- randomForest(aki ~ ., data = train, ntree = 100, importance = TRUE)
    rf.pred.curr <- predict(rf, newdata = test, type = "prob") 
    pred.outputs.rf[1:length(s[s == i]) + offset] <- rf.pred.curr[ , 2]
    
    offset <- offset + length(s[s == i])
}
# AUC
roc(obs.outputs, pred.outputs.glm) # glm with cross validation
## Setting levels: control = 1, case = 2
## Setting direction: controls < cases
## 
## Call:
## roc.default(response = obs.outputs, predictor = pred.outputs.glm)
## 
## Data: pred.outputs.glm in 25064 controls (obs.outputs 1) < 2452 cases (obs.outputs 2).
## Area under the curve: 0.7051
roc(obs.outputs, pred.outputs.rf, ci = TRUE) # rf with cross validation
## Setting levels: control = 1, case = 2
## Setting direction: controls < cases
## 
## Call:
## roc.default(response = obs.outputs, predictor = pred.outputs.rf,     ci = TRUE)
## 
## Data: pred.outputs.rf in 25064 controls (obs.outputs 1) < 2452 cases (obs.outputs 2).
## Area under the curve: 0.6808
## 95% CI: 0.6699-0.6917 (DeLong)
# ROC Curves
plot.roc(obs.outputs, pred.outputs.glm, col = "darkred") # Glm, k fold cross validated
## Setting levels: control = 1, case = 2
## Setting direction: controls < cases
plot.roc(obs.outputs, pred.outputs.rf, ci = TRUE, col = "navy", add = TRUE) # RF, K-fold cross validated
## Setting levels: control = 1, case = 2
## Setting direction: controls < cases
legend("bottomright", legend = c("glm cross-validation", "rf cross-validation"), col = c("darkred", "navy"), lwd = 1)

Figure 17 ROC Curves for the Complete Dataset with Categorical Data

# PR Curves
pr <- pr.curve(pred.outputs.glm[obs.outputs == "2"], 
               pred.outputs.glm[obs.outputs == "1"], curve=TRUE)
plot(pr)

Figure 18 PR Curves for the Logistic Regression on the Complete Dataset with Categorical Data

prrf <- pr.curve(pred.outputs.rf[obs.outputs == "2"], 
               pred.outputs.rf[obs.outputs == "1"], curve=TRUE)
plot(prrf)

Figure 19 PR Curves for the Random Forest Model on the Complete Dataset with Categorical Data

Figures 17-19 show the performance of the logistic regression and random forest models when trained on the complete dataset with only categorical variables. The overall model performance was similar to the performance when the models were trained on data containing continuous variables (see figures 2-4).

Mobel Building for ICU Dataset

N = nrow(icu)
K = 10
set.seed(1234)
s = sample(1:K, size = N, replace = T)
pred.outputs.glm.icu <- vector(mode = "numeric", length = N)
pred.outputs.rf.icu <- vector(mode = "numeric", length = N)
obs.outputs.icu <- vector(mode = "numeric", length = N)
offset <- 0
for(i in 1:K){
    train <-filter(icu.cat, s != i)
    test <- filter(icu.cat, s == i)
    obs.outputs.icu[1:length(s[s == i]) + offset] <- test$aki
    
    #GLM train/test
    glm.aki.icu <- glm(aki ~ ., data = train, family = binomial(logit))
  glm.predictions.icu <- predict(glm.aki.icu, test, type = "response")
  pred.outputs.glm.icu[1:length(s[s == i]) + offset] <- glm.predictions.icu
  
    #RF train/test
    rf.icu <- randomForest(aki ~ ., data = train, ntree = 100, importance = TRUE)
    rf.pred.curr.icu <- predict(rf.icu, newdata = test, type = "prob") 
    pred.outputs.rf.icu[1:length(s[s == i]) + offset] <- rf.pred.curr.icu[ , 2]
    
    offset <- offset + length(s[s == i])
}
#AUC
roc(obs.outputs.icu, pred.outputs.glm.icu) # glm with cross validation
## Setting levels: control = 1, case = 2
## Setting direction: controls < cases
## 
## Call:
## roc.default(response = obs.outputs.icu, predictor = pred.outputs.glm.icu)
## 
## Data: pred.outputs.glm.icu in 3900 controls (obs.outputs.icu 1) < 585 cases (obs.outputs.icu 2).
## Area under the curve: 0.6838
roc(obs.outputs.icu, pred.outputs.rf.icu, ci = TRUE) # rf with cross validation
## Setting levels: control = 1, case = 2
## Setting direction: controls < cases
## 
## Call:
## roc.default(response = obs.outputs.icu, predictor = pred.outputs.rf.icu,     ci = TRUE)
## 
## Data: pred.outputs.rf.icu in 3900 controls (obs.outputs.icu 1) < 585 cases (obs.outputs.icu 2).
## Area under the curve: 0.6737
## 95% CI: 0.6512-0.6961 (DeLong)
#ROC Curves
plot.roc(obs.outputs.icu, pred.outputs.glm.icu, col = "darkred") # Glm, k fold cross validated
## Setting levels: control = 1, case = 2
## Setting direction: controls < cases
plot.roc(obs.outputs.icu, pred.outputs.rf.icu, ci = TRUE, col = "navy", add = TRUE) # RF, K-fold cross validated
## Setting levels: control = 1, case = 2
## Setting direction: controls < cases
legend("bottomright", legend = c("glm cross-validation", "rf cross-validation"), col = c("darkred", "navy"), lwd = 1)

Figure 20 ROC Curves for the ICU Dataset with Categorical Data

# PR Curves
pr.icu <- pr.curve(pred.outputs.glm.icu[obs.outputs.icu == "2"], 
               pred.outputs.glm.icu[obs.outputs.icu == "1"], curve=TRUE)
plot(pr.icu)

Figure 21 PR Curves for the Logistic Regression Model on the ICU Dataset with Categorical Data

prrf.icu <- pr.curve(pred.outputs.rf.icu[obs.outputs.icu == "2"], 
               pred.outputs.rf.icu[obs.outputs.icu == "1"], curve=TRUE)
plot(prrf.icu)

Figure 22 PR Curves for the Random Forest Model on the ICU Dataset with Categorical Data

Figures 21-22 show the model performance for the random forest and logistic regression when trained on categorical data only for the ICU subset. The performance here is similar to models trained on continuous plus categorical data.

Mobel Building for non-ICU Dataset

N = nrow(nonicu)
K = 10
set.seed(1234)
s = sample(1:K, size = N, replace = T)
pred.outputs.glm.nonicu <- vector(mode = "numeric", length = N)
pred.outputs.rf.nonicu <- vector(mode = "numeric", length = N)
obs.outputs.nonicu <- vector(mode = "numeric", length = N)
offset <- 0
for(i in 1:K){
    train <-filter(nonicu.cat, s != i)
    test <- filter(nonicu.cat, s == i)
    obs.outputs.nonicu[1:length(s[s == i]) + offset] <- test$aki
    
    #GLM train/test
    glm.aki.nonicu <- glm(aki ~ ., data = train, family = binomial(logit))
  glm.predictions.nonicu <- predict(glm.aki.nonicu, test, type = "response")
  pred.outputs.glm.nonicu[1:length(s[s == i]) + offset] <- glm.predictions.nonicu
  
    #RF train/test
    rf.nonicu <- randomForest(aki ~ ., data = train, ntree = 100, importance = TRUE)
    rf.pred.curr.nonicu <- predict(rf.nonicu, newdata = test, type = "prob") 
    pred.outputs.rf.nonicu[1:length(s[s == i]) + offset] <- rf.pred.curr.nonicu[ , 2]
    
    offset <- offset + length(s[s == i])
}
#AUC
roc(obs.outputs.nonicu, pred.outputs.glm.nonicu) # glm with cross validation
## Setting levels: control = 1, case = 2
## Setting direction: controls < cases
## 
## Call:
## roc.default(response = obs.outputs.nonicu, predictor = pred.outputs.glm.nonicu)
## 
## Data: pred.outputs.glm.nonicu in 21164 controls (obs.outputs.nonicu 1) < 1867 cases (obs.outputs.nonicu 2).
## Area under the curve: 0.698
roc(obs.outputs.nonicu, pred.outputs.rf.nonicu, ci = TRUE) # rf with cross validation
## Setting levels: control = 1, case = 2
## Setting direction: controls < cases
## 
## Call:
## roc.default(response = obs.outputs.nonicu, predictor = pred.outputs.rf.nonicu,     ci = TRUE)
## 
## Data: pred.outputs.rf.nonicu in 21164 controls (obs.outputs.nonicu 1) < 1867 cases (obs.outputs.nonicu 2).
## Area under the curve: 0.6723
## 95% CI: 0.6597-0.6849 (DeLong)
#ROC Curves
plot.roc(obs.outputs.nonicu, pred.outputs.glm.nonicu, col = "darkred") # Glm, k fold cross validated
## Setting levels: control = 1, case = 2
## Setting direction: controls < cases
plot.roc(obs.outputs.nonicu, pred.outputs.rf.nonicu, ci = TRUE, col = "navy", add = TRUE) # RF, K-fold cross validated
## Setting levels: control = 1, case = 2
## Setting direction: controls < cases
legend("bottomright", legend = c("glm cross-validation", "rf cross-validation"), col = c("darkred", "navy"), lwd = 1)

Figure 23 ROC Curves for the non-ICU Dataset with Categorical Data

# PR Curves
pr.nonicu <- pr.curve(pred.outputs.glm.nonicu[obs.outputs.nonicu == "2"], 
               pred.outputs.glm.nonicu[obs.outputs.nonicu == "1"], curve=TRUE)
plot(pr.nonicu)

Figure 24 PR Curve for the Logistic Regression Model on the non-ICU Dataset with Categorical Data

prrf.nonicu <- pr.curve(pred.outputs.rf.nonicu[obs.outputs.nonicu == "2"], 
               pred.outputs.rf.nonicu[obs.outputs.nonicu == "1"], curve=TRUE)
plot(prrf.nonicu)

Figure 25 PR Curve for the Random Forest Model on the non-ICU Dataset with Categorical Data

Figures 23-25 show the model performance for the random forest and logistic regression when trained on categorical data only for the non-ICU subset. The performance here is similar to models trained on continuous plus categorical data.

Identifying Top Predictors

# Identify top predictors for the combined data set with categorized variables
# Logistic Regression Importance
glm.aki.sum <- summary(glm.aki)
glm.aki.z <- data.frame(glm.aki.sum$coefficients[,3]) # make df of z-scores
glm.aki.z <- rownames_to_column(glm.aki.z, "feature") # convert rownames to a column
glm.aki.z <- glm.aki.z[!(glm.aki.z$feature=="(Intercept)"),] # drop intercept
colnames(glm.aki.z)[2] <- "z" # rename column to z 
glm.aki.z$col.flag <- glm.aki.z$z>0 # Create a color flag
ggplot(glm.aki.z, aes(x = reorder(feature, z), y = z, fill = col.flag)) +
  geom_bar(stat = "identity") +
  coord_flip() +
  scale_fill_manual(values = c("navy", "darkred"), name = NULL, labels = c("Positively Associated", "Negatively Associated"))+
  labs(y="Relative Importance (z-score)", x = "Feature")+
  theme(axis.text.y = element_text(size = 6))

Figure 26 Relative Importance in the Logistic Regression Model for the Complete Dataset with Categorical Data

#Get 10 Top Predictors by GLM
glm.aki.pvals <- data.frame(glm.aki.sum$coefficients[,4])
top.glm.aki.pvals <- glm.aki.pvals %>%
  filter(rownames(glm.aki.pvals) != "(Intercept)") %>%
  arrange(glm.aki.sum.coefficients...4.) %>%
  slice_head(n=10)
# Random Forest Importance
gini <- as.data.frame(rf$importance)
gini <- rownames_to_column(gini, "feature") # convert rownames to a column
ggplot(gini, aes(x = reorder(feature, MeanDecreaseGini) , y = MeanDecreaseGini))+
  geom_bar(stat = "identity", fill = "darkred") +
  coord_flip()+
  labs(y="Mean Decrease in GINI Score", x = "Feature")+
  theme(axis.text.y = element_text(size = 6))

Figure 27 Relative Importance in the Random Forest Model for the Complete Dataset with Categorical Data

Figures 26 and 27 show the ranking of importance for each variable according to the logistic regression and random forest models for the complete dataset trained on categorical-only data. The most important predictors are different between the two model types.

The mean decrease in GINI scores show a variety of continuous and categorical variables among the highest ranked. The scores no longer appear inflated for the continuous variables.

# Get 10 top Predictors by RF (highest decrease in GINI)
top.rf.gini = gini %>%
  arrange(desc(MeanDecreaseGini)) %>%
  slice_head(n=10)

Repeat process for ICU subset with categorized data.

# Identify top predictors for the icu sub set
# Logistic Regression Importance
glm.aki.sum.icu <- summary(glm.aki.icu)
glm.aki.z.icu <- data.frame(glm.aki.sum.icu$coefficients[,3]) # make df of zscores
glm.aki.z.icu <- rownames_to_column(glm.aki.z.icu, "feature") # convert rownames to a column
glm.aki.z.icu <- glm.aki.z.icu[!(glm.aki.z.icu$feature=="(Intercept)"),] # drop intercept
colnames(glm.aki.z.icu)[2] <- "z" # rename column to z 
glm.aki.z.icu$col.flag <- glm.aki.z.icu$z>0 # Create a color flag
ggplot(glm.aki.z.icu, aes(x = reorder(feature, z), y = z, fill = col.flag)) +
  geom_bar(stat = "identity") +
  coord_flip() +
  scale_fill_manual(values = c("navy", "darkred"), name = NULL, labels = c("Positively Associated", "Negatively Associated"))+
  labs(y="Relative Importance (z-score)", x = "Feature")+
  theme(axis.text.y = element_text(size = 6))

Figure 28 Relative Importance in the Logistic Regression Model for the ICU Dataset with Categorical Data

#Get 10 Top Predictors by GLM
glm.aki.pvals.icu <- data.frame(glm.aki.sum.icu$coefficients[,4])
top.glm.aki.pvals.icu <- glm.aki.pvals.icu %>%
  filter(rownames(glm.aki.pvals.icu) != "(Intercept)") %>%
  arrange(glm.aki.sum.icu.coefficients...4.) %>%
  slice_head(n=10)
# Random Forest Importance
gini.icu <- as.data.frame(rf.icu$importance)
gini.icu <- rownames_to_column(gini.icu, "feature") # convert rownames to a column
ggplot(gini.icu, aes(x = reorder(feature, MeanDecreaseGini) , y = MeanDecreaseGini))+
  geom_bar(stat = "identity", fill = "darkred") +
  coord_flip()+
  labs(y="Mean Decrease in GINI Score", x = "Feature")+
  theme(axis.text.y = element_text(size = 6))

Figure 29 Relative Importance in the Random Forest Model for the ICU Dataset with Categorical Data

Figures 28 and 29 show that the random forest and logistic regression models found different variables to be the most important on the ICU data subset when trained on categorical only data. When compared to the complete dataset, the logistic regression model found normal range of serum creatinine be the most important in the ICU subset (greatest magnitude in Z score) and in the complete dataset. These top predictors are different than the top predictors for the categorical with continuous data results (compare figures 13 and 29).

# Get 10 top Predictors by RF (highest decrease in GINI)
top.rf.gini.icu = gini.icu %>%
  arrange(desc(MeanDecreaseGini)) %>%
  slice_head(n=10)

Repeat process for non-ICU subset with categorized data.

# Identify top predictors for the icu sub set
# Logistic Regression Importance
glm.aki.sum.nonicu <- summary(glm.aki.nonicu)
glm.aki.z.nonicu <- data.frame(glm.aki.sum.nonicu$coefficients[,3]) # make df of zscores
glm.aki.z.nonicu <- rownames_to_column(glm.aki.z.nonicu, "feature") # convert rownames to a column
glm.aki.z.nonicu <- glm.aki.z.nonicu[!(glm.aki.z.nonicu$feature=="(Intercept)"),] # drop intercept
colnames(glm.aki.z.nonicu)[2] <- "z" # rename column to z 
glm.aki.z.nonicu$col.flag <- glm.aki.z.nonicu$z>0 # Create a color flag
ggplot(glm.aki.z.nonicu, aes(x = reorder(feature, z), y = z, fill = col.flag)) +
  geom_bar(stat = "identity") +
  coord_flip() +
  scale_fill_manual(values = c("navy", "darkred"), name = NULL, labels = c("Positively Associated", "Negatively Associated"))+
  labs(y="Relative Importance (z-score)", x = "Feature")+
  theme(axis.text.y = element_text(size = 6))

Figure 30 Relative Importance in the Logistic Regression Model for the non-ICU Dataset with Categorical Data

#Get 10 Top Predictors by GLM
glm.aki.pvals.nonicu <- data.frame(glm.aki.sum.nonicu$coefficients[,4])
top.glm.aki.pvals.nonicu <- glm.aki.pvals.nonicu %>%
  filter(rownames(glm.aki.pvals.nonicu) != "(Intercept)") %>%
  arrange(glm.aki.sum.nonicu.coefficients...4.) %>%
  slice_head(n=10)
# Random Forest Importance
gini.nonicu <- as.data.frame(rf.nonicu$importance)
gini.nonicu <- rownames_to_column(gini.nonicu, "feature") # convert rownames to a column
ggplot(gini.nonicu, aes(x = reorder(feature, MeanDecreaseGini) , y = MeanDecreaseGini))+
  geom_bar(stat = "identity", fill = "darkred") +
  coord_flip()+
  labs(y="Mean Decrease in GINI Score", x = "Feature")+
  theme(axis.text.y = element_text(size = 6))

Figure 31 Relative Importance in the Random Forest Model for the non-ICU Dataset with Categorical Data

Figures 30 and 31 show that the random forest and logistic regression models found different variables to be the most important on the non-ICU data subset when trained on categorical only data. When compared to the complete dataset, the logistic regression model here also found normal serum creatinine to be the most important predictor (by greatest magnitude in Z score). These top predictors match the top predictors for the categorical with continuous data results (compare figures 15 and 30).

# Get 10 top Predictors by RF (highest decrease in GINI)
top.rf.gini.nonicu = gini.nonicu %>%
  arrange(desc(MeanDecreaseGini)) %>%
  slice_head(n=10)

Make a table that compare the top predictors for the different methods and the different datasets using the categorized data.

# Make a dataframe with all the results
top.compare.pval <-data.frame(rownames(top.glm.aki.pvals), rownames(top.glm.aki.pvals.nonicu),  rownames(top.glm.aki.pvals.icu)) %>%
  rename("combined" = 1, "non-ICU" =2, "ICU" = 3)
top.compare.pval
##          combined        non-ICU            ICU
## 1   creatinineref  creatinineref  creatinineref
## 2  creatininehigh         fluid1 creatininehigh
## 3       diuretic1      diuretic1        abxNTX1
## 4        bactrim1 creatininehigh      diuretic1
## 5          fluid1       bactrim1           ckd1
## 6      priorLos>7           ckd1     priorLos>7
## 7            ckd1           chf1      ntxOther1
## 8          icuICU     priorLos>7           chf1
## 9     priorLos3-7      raceWhite     indexGFRG2
## 10        abxNTX1  periOpPOD two     indexGFRG1

Table 5 Top 10 Predictors for the Logistic Regression Models on the Three Data Sub-sets using Categorical Data

top.compare.gini <-data.frame(top.rf.gini$feature,  top.rf.gini.nonicu$feature, top.rf.gini.icu$feature)
top.compare.gini
##    top.rf.gini.feature top.rf.gini.nonicu.feature top.rf.gini.icu.feature
## 1             indexGFR                        bmi                indexGFR
## 2                  bmi                   indexGFR                     bmi
## 3             priorLos                   priorLos                priorLos
## 4                 race                       race                  periOp
## 5                   dm                         dm                    race
## 6                  age                        age                      dm
## 7             chloride                        wbc                chloride
## 8               periOp                   chloride                     age
## 9            platelets                  platelets               platelets
## 10                 wbc                     sodium                  sodium

Table 6 Top 10 Predictors for the Random Forest Models on the Three Data Sub-sets Using Categorical Data

The logistic regression models found different variables to be the most important by ICU-status when continuous data was categorized. These results differ from the dataset that included continuous predictors (compare table 3 and table 5).

Top predictors for mean decrease in GINI scores differ between ICU and non-ICU status. Though variables were more consistent between the datasets for the random forest models than for the logistic regression models.

Discussion

The top predictors for AKI were different depending on ICU status in the logistic regression models when either categorical only or categorical with continuous data were used for training. The top predictors were different depending on ICU status in the random forest model when categorical data were used.

The results of this study also demonstrate the well-known need to carefully prepare variables for analysis in the random forest models and logistic regression models. Top predictors varied not only by ICU status, and by prediction model (regression versus random forest), but also by data-type (categorical versus continuous). A skillful analysis of the importance of each individual variable is warranted for inclusion in AKI prediction models depending on the type of model and for deciding how to handle the variable (eg. whether to categorize it).

The implications of this study are that AKI prediction models need to consider the patient specific factors such as hospital location when building models. Future work needs to evaluate the effect of sample size on top-predictor analysis and why there is such variability in model type. Finally, incorporating variable selection methods in the logistic regression model is a clear next step.

References

  1. Goyal A, Daneshpajouhnejad P, Hashmi MF, et al. Acute Kidney Injury. [Updated 2022 Aug 18]. In: StatPearls [Internet]. Treasure Island (FL): StatPearls Publishing; 2022 Jan-. Available from: https://www.ncbi.nlm.nih.gov/books/NBK441896/.

  2. CDC. “About Adult BMI.” CDC.Gov (2022). Accessed November 2022. https://www.cdc.gov/healthyweight/assessing/bmi/adult_bmi/index.html#InterpretedAdults.

  3. De Vlieger, Greeta; Kashani, Kianoushb; Meyfroidt, Geerta. Artificial intelligence to guide management of acute kidney injury in the ICU: a narrative review. Current Opinion in Critical Care: December 2020 - Volume 26 - Issue 6 - p 563-573 doi: 10.1097/MCC.0000000000000775.

  4. Kratz A, Pesce MA, Basner RC, Einstein AJ. Laboratory Values of Clinical Importance. In: Kasper D, Fauci A, Hauser S, Longo D, Jameson J, Loscalzo J. eds. Harrison’s Principles of Internal Medicine, 19e. McGraw Hill; 2014. Accessed November 23, 2022. https://accessmedicine-mhmedical-com.proxy.library.upenn.edu/content.aspx?bookid=1130&sectionid=79722706

  5. Miano, Todd A., et al. “Effect of renin-angiotensin system inhibitors on the comparative nephrotoxicity of NSAIDs and opioids during hospitalization.” Kidney360 1.7 (2020): 604.

  6. Nembrini S, König IR, Wright MN. The revival of the Gini importance? Bioinformatics. 2018 Nov 1;34(21):3711-3718. doi: 10.1093/bioinformatics/bty373. PMID: 29757357; PMCID: PMC6198850.

  7. Penn Medicine. University of Pennsylvania Health System Test Directory. Accessed November 2022. https://www.testmenu.com/UPHS.

  8. Strobl C, Boulesteix AL, Zeileis A, Hothorn T. Bias in random forest variable importance measures: illustrations, sources and a solution. BMC Bioinformatics. 2007 Jan 25;8:25. doi: 10.1186/1471-2105-8-25. PMID: 17254353; PMCID: PMC1796903.